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Abstract 

This paper proposes a novel statistical approach to intelligent document re- 
trieval. It seeks to offer a more structured and extensible mathematical 
approach to the term generalization done in the popular Latent Semantic 
Analysis (LSA) approach to document indexing. A Markov Random Field 
(MRF) is presented that captures relationships between terms and docu- 
ments as probabilistic dependence assumptions between random variables. 
From there, it uses the MRF-Gibbs equivalence to derive joint probabilities 
as well as local probabilities for document variables. A parameter learning 
method is proposed that utilizes rank reduction with singular value decom- 
position in a matter similar to LSA to reduce dimensionality of document- 
term relationships to that of a latent topic space. Experimental results con- 
firm the ability of this approach to effectively and efficiently retrieve docu- 
ments from substantial data sets. 
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1 Introduction 

Research in the field of information retrieval is becoming increasingly im- 
portant as large sources of data become available and users become accus- 
tomed to powerful and flexible ways of processing this information. It is 
now accepted that simple data retrieval methods based on naive term match- 
ing fail to function effectively for large and varied bodies of data |T|. In 
particular, users are beginning to seek methods of retrieval that examine the 
meanings of queries rather than the queries themselves. One promising ap- 
proach to this, Latent Semantic Analysis (LSA), was proposed by |2| as 
at attempt to generalize terms into latent topic concepts using linear alge- 
bra techniques. We seek to provide a more structured approach to accom- 
plishing term generalization similar to LSA using a Markov Random Field 
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model. We believe that this approach has a more solid foundation and pro- 
vides researchers with a better understanding of the underlying mathematics 
and potential for extension. 



1.1 Related Work 

1.1.1 Latent Semantic Analysis 

Latent Semantic Analysis (LSA) is a method used in information retrieval 
for smoothing sets of document-term data. Documents in a large collec- 
tion are subject to statistical over-specification, as each one only contains a 
small fraction of the terms despite being relevant with respect to many other 
terms. LSA expands upon a vector-space model f3l in which documents are 
represented as row vectors of terms. A co-occurrence matrix X representing 
a collection of documents can be defined as a matrix whose rows are term 
vectors T and columns are document vectors D. 



X 



Xii • • • Xi 



The value xt^d refers to the number of times term t appears in document 
d. This representation is convenient because it allows the similarity of any 
column vector d of matrix X and query vector q to be calculated as the 
cosine of the angle between the two vectors using: 

d • q 

||d|| ||q|| 

One problem with this approach is that, since it relies solely on terms 
as being independent, it fails to capture the semantic relationship between 
synonyms and other examples of distinct but related terms. It also results in 
poor and uneven recall because it relies on the specific wording of the query, 
and, without any smoothing, many relevant documents could be missed due 
to lexical discrepencies. 

LSA attempts to generalize terms into a latent topic space by reducing 
the dimensionality of the co-occurrence matrix. This is accomplished by 
first taking a Singular Value Decomposition on the co-occurrence matrix. 
This produces three new matrices, U, S, and V such that X = USV^. U 
and V contain orthogonal column vectors while S is a diagonal matrix. The 
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diagonal of S forms a vector of singular values a. 
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To reduce the dimensionality of the matrix, a number k of the singular 
values are kept, and the rest are discarded. The number of singular values 
to keep is arbitrary, but implementations almost always keep large singular 
values (cjj > 3 or so) and discard small ones (fij < 0.5). Intuitively, these 
larger values are important to the document collection, while smaller ones 
only serve to contribute to the over-specification. 

The product of the resulting matrices U^, S^, and produces a di- 
mensionally reduced co-occurrence matrix Xk. 
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Here, the vectors Uj = 




..,Ui^n) and Vi = 






are left 



and right singular row vectors for X. 

To compare documents and terms in this new latent space, it must be 
shown that there exists an analog in LSA to the inner space used for finding 
the similarity in the original vector space model. The dot products between 
all documents in the collection is calculated with X^X. The following ma- 
nipulations |l2|| show that this is equivalent to the following latent space con- 
cept: 

X^X = (USV'^)'^USV'^ = VSU^USV^ = YSSV"^ = (VS)(VS)'^ 

(2) 

This means that document comparison is now possible by using the inner 
products of rows from the VS^ matrix from equation [l] 

A comparison among terms is done similarly, by first taking: 



XX^ = USV^(USV'^)^ = USV'^VSU^ = USSU^ = (US)(US)^ 

(3) 
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The inner product of rows from equation [3f s US matrix allow terms to 
be compared. 

Finally, a query is represented as a new document vector q containing 
the term counts found in the query. This can be transformed into the latent 
space as = q^U,fcS^^. The previously mentioned method for comparing 
documents in latent space can now be utilized to rank documents. 

LSA is a useful technique for improving the quality of query results, but 
it suffers from a weak mathematical foundation that does not provide a solid 
set of statistical assumptions about its operations. As it does not specify 
any kind of generative model, it produces no clear normalized probability 
distribution, and instead focuses on finding a rank k matrix that minimizes 
the Frobenius norm error with the co-occurrence matrix. While using Sin- 
gular Value Decompositions with limited singular values has been shown 
im to always produce such a rank k matrix, there is not much room to ex- 
pand the retrieval model to include concepts like query expansion and term 
dependence. 

1.1.2 Statistical Approaches 

Probabilistic Latent Semantic Analysis (PLS A) is a way of providing a more 
structured approach to the problem of identifying latent concepts |5]. PLS A 
takes a stronger statistical approach by constructing a generative model for 
the model. 

PLSA represents documents and terms as vectors D and W, and uses an 
aspect model that associates an observed class variable z S Z with observed 
documents. The joint distribution is represented as: 

P{d, w) = P{d) ^ P{w\z)P{z\d) 

The generative model is then fitted through maximum likelihood with 
the Expectation Maximization (EM) algorithm. 

One improvement to PLSA called Latent Dirichlet Allocation (LDA) 
was proposed which seeks to capture more of the document collection's de- 
pendence relationships. Specifically, LDA takes a Bayesian approach and 
performs inference with prior distributions for terms and documents. In par- 
ticular, this method gives more generalization, as it constructs a true gener- 
ative model that represents both seen and unseen documents. 

Both LDA and PLSA reevaluate the mathematical underpinnings of LSA 
for Information Retrieval, but do so by discarding the linear algebra ap- 
proach of LSA in favor of a different, more structurally sound statistical 
model. 
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1.1.3 Information Retrieval with Markov Random Fields 

The task of expanding the basic vector space model was approached by Q 
with a formal Markov Random Field framework. In this approach, three 
methods were offered for modeling term dependencies: independent, se- 
quential, and fully dependant. The suggested approach was the sequential 
dependency graph, containing cliques representing documents, terms, or- 
dered term sequences, and unordered term sequences. The criteria for rank- 
ing documents based on sequential dependencies was this ranking function: 

P{D\Q) ^^XtMc) + J2>^ofo{c) + Yl ^ufuic) (4) 
ceT cgo ceouu 

The functions fx, fo> and fu are clique potential functions represent- 
ing the compatibility of clique in the given distribution. The set of weights 
(At, Ao, A[/) is then learned by using a hill climbing search to optimize the 
mean average precision. He showed [8] that the surface is concave, so find- 
ing a global maximum is likely. Clique functions utilize simple smoothing 
based on a Dirichlet prior to help generalize the term-document space. 

This approach uses Markov Random Fields (MRF) as a model for pro- 
ducing the weighted sum of functions relating terms and documents in equa- 
tion |4] It is important to note that while, since it is simply another way of 
stating common information retrieval formulas, this is not by itself a major 
advance in information retrieval. Its real value lies instead in the firm foun- 
dation that it provides for applying those formulas, as it specifies both the 
conditional assumptions made by the equations themselves as well as the 
method for applying them together. Because it provides such a solid frame- 
work for MRF-based document retrieval, its authors successfully build upon 
this foundation with extensions describing implicit user preference [9J, fea- 
ture selection lITOll . and latent concept expansion ifTTI . 

1.2 Overview of MRF Topic Identification 

In order to achieve the level of flexibility and extensibility achieved by [i7 1 in 
that MRF model, we propose another MRF that seeks to capture the smooth- 
ing gained from the reduced dimensionality co-occurrence matrix in LSA. 
A general method for defining MRF will be outlined and applied to a term- 
document dependency graph. A learning strategy will then demonstrate 
that LSA's topic clustering can be achieved with the general term-document 
MRF approach. 
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2 Theory 



2.1 Markov Random Fields 

MRFs provide a flexible framework for depicting conditional relationships 
between a set of random variables. Unlike similar models such as Markov 
Chains and Bayesian Networks, MRFs are not limited to specifying one-way 
(or causal) links between random variables. 

2.1.1 Definition 

MRFs represent a group of random variables with symmetric neighbor rela- 
tions that satisfy a set |[T2l of conditions: 

• The probability of any variable given the rest of the MRF is equal to 
the probability of that variable given its neighbors. 

• The probability of any set of random variables in the MRF is greater 
than zero. 

The first condition, the Markov property for the MRF, means that com- 
paring probabilities is much simpler, since many of the random variables can 
be ignored when the one being considered does not depend on them. The 
second condition simply limits local probabilities to an open interval (0, 1). 

To obtain a global distribution for random variables in a MRF, it is first 
necessary to demonstrate the equivalence between the MRF and the Gibbs 
distribution (12]. This can be shown with the Hammersley-Clifford theorem. 
This theorem states that given the random vector x, a collection of graph 
dependencies G consiting of dependencies based on a symmetric neighbor 
relation u C x x x, and a set of maximal cliques C on this graph, the random 
vector is a MRF is given a joint probability distribution: 

P(x) = 

Where Z here is a normalization constant that is generally infeasible to cal- 
culate. V{'x.) refers to a family of potential functions that describe the com- 
patibility of clique structures on x. This equivalence, know as the Hammersley- 
Clifford theorem, while never published, was proven in later publications 

m. 

2.1.2 Constructing an MRF Model 

Define a Graph Structure 
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The first step in constructing a MRF is to produce a graph G that contains 
a vector of random variables x that satisfies the positivity assumption. This 
assumption may be restated from its previous definition to say that each ran- 
dom vector may occur with a nonzero probabiUty. In practice, this constraint 
is easily met with a well constructed graph. 

Define Clique Structure 

While factorizing the maximal cliques in a given graph has been shown 
to be NP-complete [14], a well-designed structure can lead to an easily ob- 
tainable and semantically meaningful set of cliques. 

Write Clique Potential Functions 

Once clique structures have been defined, it is now necessary to define 
clique potential functions for them. These potential functions represent the 
compatibility of the clique for the particular distribution. 

The individual clique potential functions combine as: 

y(x) = j;nx) (5) 

Where C is a family of clique configurations and y^(x) refers to the 
potential function defined for clique configuration c. 
Obtain Joint Distribution 

The Hammersley-Clifford Theorem now allows the joint distribution 
over X to be defined as: 

g-V(x) 

^'(x) = (6) 
Applying function|5]to equation|6]produces: 

^'(x) = ^ (7) 

Defining Z as Z = Ylyes ^^^^^ where S is the set of all MRF config- 
urations for X, the joint distribution can be written as: 

Provide Learning Strategy 

The last step is to define a method for learning MRF parameters. An ex- 
ample of one learning strategy is the hill climbing approach taken by Metzler 
to optimize the weights given to the clique potential functions in equation]?] 
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Figure 1 : Example Term Document Graph Structure 



2.2 An MRF Model for Information Retrieval 

With these steps defined, it is now possible to construct a MRF model for 
representing LSA in Information Retrieval. 

2.2.1 Graph Structure 

The random variables in the MRF will be binary valued random variables. 
This choice to declare the random variables as binary-valued leads to the 



concise clique functions and probability calculation done in 2.2.5 and 2.2.6 
For brevity, it is often convenient to represent the collection of term vari- 
ables as a row vector T and the collection of document variables as column 
vector D. 

T = [ti,...,tn] (9) 

n = [di,...,dmf (10) 

Now that the variables in the MRF have been defined, it is necessary to 
supply neighbor relations uj on our graph G representing conditional depen- 
dence. For this graph structure, each document will be connected to every 
other term, and each term will be connected to every other document. In this 
design, the t nodes represent the pool of terms in our collection, while the d 
nodes represent the documents containing one or more of those terms. 
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Figure [T] gives an example of this MRF configuration to visually illus- 
trate the dependence assumptions made in this design. Semantically, this 
can be viewed as making the same independence assumptions made in the 
vector space model that LSA utilizes. Specificially, we view each document 
as only being dependent on the terms it contains. In this way, it is equivalent 
to the vector-space (bag-of-words model) that stores term counts without 
any dependence information. 

2.2.2 Clique Definition 

One benefit to the structure we have defined is that it lends itself to easily 
factored cliques with semantic meaning. There are three types of cliques in 
this graph: C = {T, D, T x D}. Cliques over T and D are simple cliques 
consisting of individual documents and terms, while cliques over T x D are 
pairs representing term occurrences. 

When producing clique functions, the singleton cliques (T and D) pro- 
vide an opportunity to weight the importance of terms or documents to the 
joint distribution. The pairwise cliques (T x D) allow the "compatibility" 
of documents and terms to have an effect on the distribution. 

2.2.3 Clique Potential Functions 

The simplest clique potential function taking the set of random variables X 
that may be expressed is the sum of the single and double member clique 
potential functions: 

{Xi,Xj) (11) 

i i j 

This is just a sum of the single and double member cliques. One benefit 
to giving our random variables binary values is that it allows this expres- 
sion to be simplified greatly without losing any generality. For any clique 
whose potential function is V{xi) = XiVi{xi), it can only take two values: 
or Vi{xi). Furthermore, if we declare that single clique functions evaluate to 
members of parameter vectors b and g such that Vi{ti) = hi and Vi{di) = gi, 
then tiVi{ti) = or 6j and diVi{di) = or dj. Similarly, if potential func- 
tions for double member cliques (tj, dj) evaluate to members of parameter 
matrix W such that v{ti, dj) = Wij, the expression tidjVij{ti, dj) = or 
Wij . Given this flexible representation for individual clique potential func- 
tions, the sum of all clique potential functions in equation [TT] required for 
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the joint distribution may be written without any loss in generality as: 



y (X) = ^ biU + Yl 9jdj + (12) 

i j i j 

It will occasionally be convenient to notate this function in terms of vec- 
tors T and D mentioned in equations |9] and 10 This can be done as such: 



V{X.) = bT^ + gD + TWD (13) 
2.2.4 Joint Distribution 

Now that families of cliques have been defined and given potential functions, 
an equation for the joint distribution of the MRF model X may be written, 
using equation I8l as: 



. ^ exp(Er bjtj + E7 9jdj + Er Wjjtjdj) ^^^^ 
Eyes exp(E" biti + Ef gjdj + E? E™ Wijtidj) 

2.2.5 Local Probabilities 

For information retrieval, local probabilities for individual random variables 
must be defined. In particular, this is necessary to find the probability of a 
particular document di given a set of query terms. For the manipulations 
required to demonstrate the derivation of this probability, some compact no- 
tations will be adopted for the sake of brevity and clarity. 

• The expression P{Xi = 1) denotes the probability of some binary 
variable, either tj or di, taking on the value 1. 

• The expression P{X^i) denotes the probability of every value in X 
except for Xi, or Xj+i, ...,Xd). 

• The expression P{X^^^) denotes the joint probability of X such that 
X, = k, OTP{Xi,...,Xi = k,...,Xd). 

The desired probability may be stated as: 

P{di = l|X_i) 
More clearly, this is equivalent to: 

Pidi = l\ti, ...,tn,di, dj-i, dj+i, ....dm) 
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To begin obtaining this probability, it must first be rewritten using a more 
general form with the compact notation provided above as: 

This can be manipulated with the following steps: 

P{Xi = l,X^i) 



P{Xi = l\X. 



X-i 

P{X-i) 

P{X^=^) 
P(X*=i) + P(X»=0)) 

1 



P(X'=0) 



1 ~l~ D/ 



When the joint probability (equation [6]) is plugged in, the Z normaliza- 
tion constants cancel to give: 

1 1 



P(X'=0) , , exp(-y(X'=0)) 
P(X«=i) ^ + exp(-y(X«=i))) 

1 



l + exp(-[y(Xi=i)-y(Xi=o)]) 

This takes the form of the sigmoid function, q{t) = j^^- It can be 
written thus as: 

P{Xi = l\X^i) = ?(F(X,=i) - V{Xi=^)) (15) 

In order to write F(Xj=o) — V{Xi=i) in terms of individual random 
variables and parameters, it is necessary to make several observations about 
the potential functions. Because, when Xi = 0, the Xi value and its asso- 
ciated parameter will have no contribution to the sum in its family's clique 
potential function as written in equation 12 It can therefore be written, in 
the special case considered here in which Xi is a document variable: 

V{Xi=Q) = ^ bntn + ^ ^ dmdrn ~l~ ^ ^ ^ ^ ^^nmindm (16) 

Likewise, it is always the case when Xi = 1 and Xi is a document 
variable, that the clique potential function for that MRF is: 

n m=^i n n m^i 

(17) 
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When these are plugged into equation 15 the shared terms cancel to 
produce the desired probability in terms of variables and parameters: 



P{Di = l\X^.l) = c;{g., + Y^Wuti) 



1=1 



This can be represented more concisely using vectors as: 

P(A = l|^-^)=?(5i + WfT^) 



(18) 



(19) 



Where W?^ indicates the transpose of the i^^ column vector of parameter 
matrix W. 

2.2.6 Learning 

The data that will be used to train the model's parameters will be a set of 
observation vectors T^, ...,T" that represent occurrence vectors from the 
data collection. T* may indicate the number of times that term j is present 
in document 1, but normalized counts such as tf — idf vectors are frequently 
preferable. 



Let us also define a matrix T 



that represents 



1 y VI 

the co-occurence matrix with a row of Is appended to the bottom. This can 
be viewed as a global term that is always on which will be used to estimate 
parameter g. 

The approach for learning parameters will be the maximization of the 
following sum squared error objective function: 



£{W,g)= I - [W g]t 



(20) 



Where | |X| |p indicates the Frobenius norm of some matrix X, and I is 
an n-dimensional identity matrix whose row vectors represent a configura- 
tion of the MRF such that the term variable Tj corresponding with observed 
occurrence vector T* is set to 1 . 

The method of maximizing this will be to solve the following equation: 



[W g]t 



The solution is obtained as: 



[W g] 



(21) 
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The term T^^ denotes the Moore-Penrose pseudoinverse of matrix T. The 
expression t|, can be calculated by using a singular value decomposition of 
T keeping k singular values. To obtain matrices U^, 8,^, and The 
pseudo-inverse may now be calculated as: 

tt = VfeS, ^U^ (22) 

It is at this point that the comparison to LSA's rank reduction can be 
drawn. In this context, the row vectors of the [W g] parameter span a k- 
dimensional subspace where k is the number of singular values that have not 
been set to zero by the SVD operation. It can be shown [4J that this proce- 
dure results in finding the [W g] that minimizes the sum squared objective 
function that predicts I from T using the formula I = [W g] T subject 
to the constraint that [W g] has rank k. This means that the subspace 
spanned by the row vectors of [W g] is reduced in dimensionality in the 
same way the latent space used to compare documents in LSA is reduced. 



3 Experiments 
3.1 Method 

The goal of these experiments is to validate the novel approach we have 
described by comparing its performance to popular retrieval methods. In 
particular, we will be looking at various information retrieval metrics and 
comparing them for varying numbers of singular values taken to reduce the 



LSA co-occurence matrix or solve 2 1 for the MRF approach. In addition. 



simple vector space term matching will be used as a baseline to evaluate the 
contribution of term generalization to the algorithms' performance. Since 
the most obvious algorithm with which to compare our MRF model is the 



popular Latent Semantic Analysis approach described in 1.1.1 it will pro- 
vide a good baseline for term generalization. 

3.1.1 Data Sets 

The text collections chosen for this paper are the four widely used collec- 
tions that, together, comprise the Classic4 data set. The four collections 
comprising Classic4 are: 

• CRAN - 3204 abstracts from the Cranfield Institute of Technology 

• CACM - 1460 abstracts from the CACM Journal 
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• CISI - 1460 abstracts from the Institute for Scientific Information 

• MED - 1033 abstracts from the National Library of Medicine 

Each collection comes with a set of queries and relevance judgments. 
This data set was selected based on the quality of the text and query infor- 
mation given as well as its contents. Academic abstracts would seem to be 
excellent targets for topic generalization because effective topic generaliza- 
tion manages to resolve the differing jargon that is used in similar academic 
fields. This particular data set has also been extensively studied in the past 
for similar document retrieval approaches such as LSA UJ, |i5J. 

3.1.2 Procedure 

Document Collection 

The document collection on which experiments were performed was a 
combined dataset of the four Classic4 document collections. Short terms 
(below 3 characters), as well as common terms (appearing in 95% or more 
documents) were excluded. Stemming was done with the popular Porter's 
stemming algorithm [15]. 

Vector Space Model 

The simplest baseline for experimentation is done with simple tf-idf term 
matching using vector space methods. Documents are ranked based on their 
angular difference from queries in document-term vector space. The method 
used involved ranking by highest cosine of the angle, using equation[T]given 
during the description of this approach previously. 

Latent Semantic Analysis 

Document ranking with LSA follows the procedure outlined in section 
1.1.1 Specifically, the data collection was loaded as a term-document matrix 
with tf-idf adjustments. Then a singular value decomposition was done, 
X = USV^, where X is the co-occurence matrix. Each query qj was 
mapped into the latent space query Lj as Lj = q^VS^^. Comparisons 
with the document collection for query k were then done by finding the 
maximum cosine angle between latent document Vj and latent query for 
each document i. This can be calculated as: 



The role of the number of values kept from the singular value decom- 
position is first tested by finding the ideal number of values to keep when 
decomposing the co-occurence matrix. Since the style of queries for each 



cos{e) 
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collection differs somewhat (samples are given in Appendix [Bj, it is neces- 
sary to view the different mean average precision values for each collection's 
queries. 

After this, precision-recall graphs are made using average precision over 
the set of all queries for each individual document collection. 
Markov Random Field Model 

Document ranking was done by loading the data collection as a term- 
document matrix with tf-idf adjustments and then applying the methods 
described in part [2] of this paper to obtain the parameters of the MRF. No 
weighting is done, the co-occurrence matrix simply records term counts. 
The formula used in equation is then used to obtain the probability of a cer- 
tain document given the terms of the MRF, which are set to match the sample 
queries given with the collections. 

The role of the singular value decomposition is first tested by finding 
the ideal number of singular values to keep when learning MRF parameters 
using a method similar to the previous LSA experiment using mean average 
precision for each collection's set of queries. 

Once this is done, it is possible to select good singular value counts for 
each query collection and create precision-recall graphs based on the average 
precision values for each set of queries. 

3.2 Results 

The results for the mean precision versus singular values taken tests (for 
both LSA and MRF model forms of rank reduction) is shown in Figures [6] 
through [5] for the four text collections. Due to the granularity of the mean 
average precision value difference between differing values kept as well as 
the large difference between mean average precision values across document 
collections, each document collection's graph will be shown indepedently. 
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Figure 2: MED Collection - Mean Precision for Varying Numbers of Singular 
Values Used (LSA) 
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Figure 3: CRAN Collection - Mean Precision for Varying Numbers of Singular 
Values Used (LSA) 
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Figure 4: CISI Collection - Mean Precision for Varying Numbers of Singular 
Values Used (LSA) 



400 500 
Singular Values 



Figure 5: CACM Collection - Mean Precision for Varying Numbers of Singular 
Values Used (LSA) 
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Figure 6: MED Collection - Mean Precision for Varying Numbers of Singular 
Values Used (MRF) 
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Figure 7: CRAN Collection - Mean Precision for Varying Numbers of Singular 
Values Used (MRF) 
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Figure 8: CISI Collection - Mean Precision for Varying Numbers of Singular 
Values Used (MRF) 
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Figure 9: CACM Collection - Mean Precision for Varying Numbers of Singular 
Values Used (MRF) 



Precision-recall graphs for the four collection queries, each using the 
best number of singular values found in the previous step are given in Fig- 
Each graph shows results for vector space indexing, 



ures 



12 through 10 



LSA, and MRF retrieval. 
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Figure 10: CACM Collection - Precision-Recall 
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Figure 1 1 : CISI Collection - Precision-Recall 
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Figure 12: MED Collection - Precision-Recall 
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Figure 13: CRAN Collection - Precision-Recall 



A visual depiction of the mean average precision for each algorithm 
shown in figure [T4| 
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Figure 14: Mean Average Precision Scores for the Three Approaches 
3.3 Discussion 

The first experimental result concerned the selection of optimal numbers of 
singular values for use in the rank reduction in LSA and the pseudo-inverse 
using the MRF method. 

For LSA, best singular value counts of 100, 600, 100, and 700 were 
found for the MED, CRAN, CISI, and CACM collections respectively. It 
was clear that some collections (MED and CISI) benefitted from smaller 
counts, while it took much larger counts for CRAN and CACM. However, 
these are still fractions of the almost 6000 terms in the original data set. 
For the MRF model, it seems that certain data sets were better suited to 
this method than others. The MED and CISI had maximums at low (200) 
singular values. CRAN took 900 singular values before tapering off in per- 
formance. CACM did not seem suited to the reduced dimensionality, as it 
continued to increase in performance after reaching around a fifth of possible 
singular values (1200 out of 5896). 

Precision-recall graphs show promise in the MRF method. It succeeds 
remarkably in querying CISI, where LSA has been known to show signif- 
icantly worse performance than simple vector space methods [IJ. For the 
CACM and CRAN collections, it outperformed LSA and either matched or 
outperformend vector space methods. The only collection in which LSA 
was strictly superior was the MED collection. It is not entirely clear why 
this is the case, although the MED collection is the smallest of the collec- 
tions and has a very small query collection, so it is possible that some aspect 
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of this unusual collection produced such polarized results. However, even in 
this case, the MRF method still outperformed simple vector space methods. 

Not only do these results suggest that our approach is sound for infor- 
mation retrieval, but they also give credibility to our previous assertion that 
the benefits from rank reduction in LSA can be matched by reducing the 
dimensionality of the MRF parameter matrix W. 

4 Conclusion 

4.1 Summary 

4.1.1 Theory 

In this paper, we have presented a methodical approach to defining a Markov 
Random Field (MRF) that captures the independence assumptions made in 
document indexing with Latent Semantic Analysis (LSA). A clearly defined 
graph structure produces a set of semantically meaningful cUque potential 
functions describing the compatibihty of documents, terms, and document- 
term pairs in the model. 

After declaring these properties of our graph, we utilized the Hammersley- 
CUfford theorem to state that the joint distribution of the random variables 
in our graph is defined with a Gibbs distribution. Some manipulation of 
probabilities was done to find a concise expression for the probabihty of any 
particular document given a set of terms. 

Finally, a method for learning parameters was proposed. This method 
minimizes a sum squared error using the Moore-Penrose pseudoinverse. Be- 
cause this pseudoinverse relies on a singular value decomposition to produce 
the desired parameters, it is possible to hmit the singular values does and 
achieve the same benefits as the rank reduction in LSA. 

4.1.2 Results 

Experiments were carried out on the medium- sized Classic4 data set of sci- 
entific abstracts. The results showed that, like LSA, the number of singular 
values kept in the rank reduction affects performance. Once the largest effec- 
tive number of singular values for each collection was determined, queries 
for each collection were executing on an MRF formed by learning with that 
number of singular values. Average precision-recall graphs for the MRF ap- 
proach as well as the LSA and vector space methods were constructed for 
each set of queries that showed effective retrieval by the MRF method. 
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The results of these queries were promising. Even though CISI was 
previously described as being difficult, precision scores remained above 0.2 
for all recall values. Both MED and CRAN collections produced excellent 
results, with 0.4755 and 0.3184 mean average precision scores respectively. 
CISI produced a mean average precision of 0.3817, a surprisingly high score 
for such a difficult collection. The most difficult collection with this method 
proved to be CACM with a score of 0.3119, but that is not significantly 
lower than the others. LSA was only able to outperform our approach on the 
small MED collection, but the MRF model outperformed LSA on the other 
3 collections. The efficacy of our method as a document retrieval engine for 
difficult collections is suggested by these results. 



4.2 Uses and Extensions 

The greatest benefit of our approach is its potential for future expansion. 
Now that a clear statistical model has been proposed that utilizes rank reduc- 
tion in a similar manner to LSA, the next step will be to add new assumptions 
to the MRF model that produce more intelligent results. Term dependencies, 
hierarchical document structures, and query expansion are several ideas for 
future research with this approach. 



A Sample Documents from Classic4 Data Set 
A.1 CRAN 

experimental investigation of the aerodynamics 
of a wing in a slipstream. an experimental 
study of a wing in a propeller slipstream was 
made in order to determine the spanwise distribution 
of the lift increase due to slipstream at different 
angles of attack of the wing and at different 
free stream to slipstream velocity ratios. the 
results were intended in part as an evaluation 
basis for different theoretical treatments of 
this problem. the comparative span loading 
curves, together with supporting evidence, showed 
that a substantial part of the lift increment 
produced by the slipstream was due to a /destalling/ 
or boundary-layer-control effect. the integrated 
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remaining lift increment, after subtracting 
this destalling lift, was found to agree well 
with a potential flow theory. an empirical 
evaluation of the destalling effects was made 
for the specific configuration of the experiment. 



A.2 CISI 



The present study is a history of the DEWEY 
Decimal Classification. The first edition of 
the DDC was published in 1876, the eighteenth 
edition in 1971, and future editions will continue 
to appear as needed. In spite of the DDC s 
long and healthy life, however, its full story 
has never been told. There have been biographies 
of Dewey that briefly describe his system, but 
this is the first attempt to provide a detailed 
history of the work that more than any other 
has spurred the growth of librarianship in this 
country and abroad. 



A.3 CACM 

This paper discusses the limited problem of 
recognition and retrieval of a given misspelled 
name from among a roster of several hundred 
names, such as the reservation inventory for 
a given flight of a large jet airliner. A program 
has been developed and operated on the Telefile 
(a stored-program core and drum memory solid-state 
computer) which will retrieve passengers' records 
successfully, despite significant misspellings 
either at original entry time or at retrieval 
time. The procedure involves an automatic scoring 
technique which matches the names in a condensed 
form. Only those few names most closely resembling 
the requested name, with their phone numbers 
annexed, are presented for the agents final 
manual selecton. The program has successfully 
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isolated and retrieved names which were subjected 

to a number of unusual (as well as usual) misspellings. 



A. 4 MED 

correlation between maternal and fetal plasma 

levels of glucose and free fatty acids. correlation 

coefficients have been determined between the 

levels of glucose and ffa in maternal and fetal 

plasma collected at delivery. significant correlations 

were obtained between the maternal and fetal 

glucose levels and the maternal and fetal ffa 

levels. from the size of the correlation coefficients 

and the slopes of regression lines it appears 

that the fetal plasma glucose level at delivery 

is very strongly dependent upon the maternal 

level whereas the fetal ffa level at delivery 

is only slightly dependent upon the maternal 

level . 

B Sample Queries from the Classic4 Data 
Set 

B. l CRAN 

B.1.1 Query 1 of 365 

what similarity laws must be obeyed when constructing 
aeroelastic models of heated high speed aircraft. 

B.1.2 Query 2 of 365 

what are the structural and aeroelastic problems 
associated with flight of high speed aircraft. 
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B.2 CISI 

B.2.1 Query 1 of 112 

What problems and concerns are there in making 
up descriptive titles? What difficulties are 
involved in automatically retrieving articles 
from approximate titles? What is the usual 
relevance of the content of articles to their 
titles? 

B.2.2 Query 2 of 112 

How can actually pertinent data, as opposed 

to references or entire articles themselves, 

be retrieved automatically in response to information 

requests? 

B.3 CACM 

B.3.1 Query 1 of 64 

What articles exist which deal with TSS (Time 
Sharing System) , an operating system for IBM 
computers? 

B.3.2 Query 2 of 64 

I am interested in articles written either by 
Prieve or Udo Pooch 

B.4 MED 

B.4.1 Query 1 of 30 

the crystalline lens in vertebrates, including 
humans . 
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B.4.2 Query 2 of 30 



the relationship of blood and cerebrospinal 
fluid oxygen concentrations or partial pressures, 
a method of interest is polarography . 
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